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We present a parallel FFT algorithm for SIMD systems following the 'Transpose Algo- 
rithm' approach. The method is based on the assignment of the data field onto a 1- 
dimensional ring of systolic cells. The systolic array can be universally mapped onto any 
parallel system. In particular for systems with next-neighbour connectivity our method 
has the potential to improve the efficiency of matrix transposition by use of hyper- 
systolic communication. We have realized a scalable parallel FFT on the APEIOO/Qua- 
drics massively parallel computer, where our implementation is part of a 2-dimensional 
hydrodynamics code for turbulence studies. 
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1. Introduction 

The efficient implementation of Fast Fourier Transforms (FFT) on parallel high 
performance computers is of tantamount importance in order to warrant broad 
scientific and industrial application of these systems. 

The ongoing development of new high speed low cost parallel computers with 
Teraflops performance, designed primarily to solve problems in lattice quantum field 
theory 1-4 , generates interest in other scientific and commercial fields with Tera- 
computing needs. There are quite a few such applications for which one would like 
to spend 10 to 1000 sustained Teraflopshours. However, low cost parallel systems go 
along with certain architectural restrictions as e.g. next-neighbour connectivity 5,6 . 
Numerical procedures like FFT that involve non-local inhomogeneous interproces- 
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sor communication patterns are therefore hard to implement efficiently on these 
machines. The communication bottleneck of FFT is even more serious on single 
addressing machines like APE100, referred to as SISAMD in the following , where 
the individual processing units do not provide indexed local addressing 1 . 

The substantial cost-factor of FFT on parallel computers arises from 'bit- 
inversion' of the location of each input data element and from the pair-wise deci- 
mation of the bit-inverse ordered array elements 7 . As a result, FFT is harassed by 
frequent non-contiguous memory accesses. On a standard computer, this fact im- 
plies repeated cache misses, while on a parallel machine it translates into non-local 
interprocessor communication. 

A well known strategy to carry out FFT on parallel systems for a dimensionality 
> 2 is based on the splitting of the d-dimensional FFT into d 1-dimensional FFT's 7 . 
This approach 6 is known as the 'Transpose Algorithm' 8 . Let us start from a d- 
dimensional data field with coordinates i\, 12, ■ ■ ■ , id, ik = 0, . . . , — 1. The field 
may be distributed such that one axis (coordinate ii) is spread across the processors, 
while the other axes are locally assigned to the processors. In this case, the 1- 
dimcnsional FFT can be computed on the local axes (coordinates 12, ■ ■ ■ ,id), i- e. 
for each coordinate ik in parallel without access to remote memory. However, as at 
least one axis has to be distributed across the processors, a reordering of the data 
field between distributed and local axes is required. Such a reordering of the data 
elements is equivalent to a matrix transposition. So the generic task is to devise an 
efficient Transpose Algorithm for distributed array layout 8 . 

In Ref. 9, a promising implementation of the Transpose Algorithm, adapted 
to mesh-based SIMD computers with next-neighbour connectivity, has been dis- 
cussed. A fc dimensional field of volume V is assigned to the processors within a 
d-dimensional toroidal mesh network. The transposition of the field is performed in 
a three-step procedure: first the data field is skewed (like rendering a matrix skew). 
This step is followed by interprocessor communication. The final step is an inverse 
skewing. 

The authors of Ref. 9 specify a pattern for the interprocessor communica- 
tion adapted to d-dimcnsional mesh architectures like APE. For the 3-dimensional 
connectivity of APEIOO/Quadrics they reached the gratifying result that the inter- 
processor communication can be performed with complexity of V/pz data commu- 
nication operations per processor, where V is the data volume. 

The two local skewing steps at the beginning and the end of the transposition 
have to be carried out according to the machine's addressing capabilities: on SIMD 
or MIMD machines, skewing is nothing but a local data assignment. An extra 
difficulty arises on SISAMD computers since they lack local addressing. The lack 
of local addressing at first sight renders the skewing phase non-scalable, i. e., each 
processor has to carry out its copy operations separately. This leads to an unac- 

"SISAMD: Single Instruction Single Adressing Multiple Data 

b A second standard procedure, the 'Bit-Exchange' algorithm, is better suited for hyper-cubic 
architectures than mesh computers. 
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ceptable performance degradation on SISAMD machines c . We will largely overcome 
this problem, by introducing a novel skewing algorithm that proceeds in \ogp steps. 

Starting out from Ref. 9, we construct a general transpose algorithm for FFT, 
that can be implemented in a scalable manner on any parallel architecture. For this 
purpose, we first associate the data field to a 1-dimensional ring of systolic cells 10 , 
that can perform both computations and communication (to neighbouring cells). 
Our systolic ring can be readily mapped onto any parallel machine's processor con- 
nectivity. One way to do this for APEIOO/Quadrics has been presented in Ref. 11. 
We can reduce the time expense of systolic interprocessor communication along the 
systolic ring by application of hyper-systolic data movement 12 ' 13 . Implemented on 
a (i-dimensional mesh, each processor has to perform 0(V/p~z~) communication 
operations on average. 

One particular target implementation system of our FFT will be the European 
Tcraflops system of 1998, APEmillc 2 . APEmille is designed as SIMD machine with 
next-neighbor connectivity. 

There are of course many Tera-computing applications where efficient parallel 
FFT algorithms will be of use, e. g. spectral methods in global climate modeling, 
turbulent fluid dynamics or quantum chromodynamics. The latter two will serve 
us as examples for illustration and benchmarking of the Transpose Algorithm. 

The implementation of a 1-dimcnsional FFT on APE which is important for 
signal processing applications will be presented in an upcoming publication 14 . 

2. Transpose Algorithm on a Systolic Ring 

We construct a Transpose Algorithm as implemented on a 1-dimensional systolic 
array with toroidal topology. 

2.1. Data assignment 

Let us consider a 2-dimensional data field (matrix) M of size p x p, where p is the 
number of processors of the parallel system. Each element mji, i = 0, . . . ,p — 1, 
j = 0, . . . ,p— 1, can be considered as sub-matrix. This structure will allow for easily 
generalization to data arrays of size V — m x n — y x ^ ®p xp with rectangular 
shape. 

In the following, we assume that the implementation network supports the 
embedding of a 1-dimcnsional ring topology. Along the ring, circular shifts in an 
homogeneous pattern can be performed. 

To achieve a universal Transpose Algorithm we are going to discuss the method 
as implemented on an array of systolic cells that can perform compute, store and 
communication operations. For illustration, in Fig. 1, a 1-dimensional ring with 
p = 8 cells is sketched. The cells are connected to their next neighbours by direct 
communication lines. The cells can perform communication operations to the next 

c Ref. 9 discusses the implementation of the Transpose Algorithm for Multi-controller APE systems 
that can partly overcome the SISAMD restrictions. 
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Figure 1 : Systolic array of 8 systolic cells forming a systolic ring. 



neighbours in a homogeneous pattern as well as synchronous compute operations 
with equal load, on cell resident data. This systolic array concept is simple enough 
to be mapped efficiently onto practically any parallel architecture and thus is the 
adequate framework for the construction of a "universal" Transpose Algorithm. 

The initial data assignment on the systolic ring is given in Fig. 2. The data 
elements are indexed following standard matrix notation. Note that matrix elements 
with equal second coordinate i are all assigned to cell i, while their first coordinate 
refers to their local address. 
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Figure 2: Assignment of the p x p data array M to the p-cell ring. 
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Figure 3: Distribution of a 2-dimensional pxp data array M on a p-ce\\ ring in skew 
order. 



2.2. Transposition 

The transposition of M proceeds within 3 steps, a pre-skewing step, that does not 
involve interprocessor communication, a row-shifting step, requiring intcrprocessor 
communication and finally a re-skewing step, again without interprocessor commu- 
nication: 

1. Pre-Skewing. The matrix M is locally skewed along the columns. The i-th 
column is shifted upward by a circular shift of stride i. After finishing this process 
for all columns i, the diagonal elements of the original matrix are located in the 
first row, see Fig. 3. 

The skewing procedure is given in the pseudo-code representation of Algo- 
rithm 1. 



Algorithm 1 Pre-Skewing. 




forcach cell i = 


: p — 1 g systolic array 


for j = : p — 


1 


m jti = rn [{j 


fi)mod p ],i 


end for 




end foreach 





Note, that in each step different cells act on different internal addresses. As 
all cells of a systolic array perform equal operations per definition, skewing is not a 
systolic process! However, we will demonstrate in appendix A how skewing can be 
performed in form of a systolic process, within p log p steps. Therefore, by mapping 
of the systolic array onto a SISAMD systems we are able to perform skewing in a 
scalable manner. Of course on SIMD and MIMD systems skewing is trivial as it 
can be carried out by usage of local addressing facilities. 
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2. Row-shifting. In the second phase, the rows of M are shifted along the ring. 
After the shift, the rows are assigned to a local memory location determined as 
follows: the j-th row is circularly shifted by a distance of j cells to the right and 
subsequently is copied from storage location j to storage location (p — j)mod p . The 
procedure is given in Algorithm 2. 



Algorithm 2 Row-Shifting. 




forcach cell 


i = : p — 1 € systolic array 


for j = 


:p-l 


= 


rn [(p-j)mod p ,(i+j)mod p ] 


end for 




end forcach 





The corresponding data locations after completion of phase 2 are illustrated in 
Fig. 4. 
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Figure 4: p x p data array M after row-shifting. 



We emphasize that in any step, each cell acts on the same local addresses. 
Therefore, the row-shifting is a systolic process and, mapping the systolic array onto 
the processors of a parallel computer, local address computations are not required. 



3. Re-Skewing. In a last step, the matrix M is skewed downwards, in analogy 
to the pre-skewing step. The i-th column is shifted by a circular shift of stride —i. 
The final data arrangement is depicted in Fig. 5. 
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Figure 5: Transposed 2-dimcnsional p x p data array M on a p-cell ring. 



The pseudo-code representation of the re-skew step is given in Algorithm 3. 
Algorithm 3 Re-Skewing. 

foreach cell i = : p — 1 € systolic array 
for j = : p — 1 

m 3,i = m [(j-,+ p )mod p ,,] 
end for 
end foreach 

For FFT in more than 1 dimensions, two such transposition have to be carried 
out. In Algorithm 4, we give a pseudo-code template for parallel FFT in two 
dimensions: 

Algorithm 4 2-dimcnsional parallel FFT. 

call 1-d-FFT ! on local axis 
call transpose 

call pre-skew 

call row-shift 

call re-skew 
call 1-d-FFT ! on local axis 
call transpose 

call pre-skew 

call row-shift 

call re-skew 



For the description of FFT we refer to Rcf. 7. 
3. Mapping on APEIOO/Quadrics 

The ring of systolic cells together with their functionality can readily be mapped 
onto the processors of a parallel machine that supports a 1-dimensional ring topol- 
ogy. We are going to describe the mapping for APEIOO/Quadrics. 
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3.1. The APEIOO/Quadrics parallel system 

The parallel system APE100 has been designed primarily for quantum chromody- 
namics 17 computations on a lattice. Its commercial version goes under the name 
Quadrics. The diagram Fig. 6 depicts the Quadrics system: compilation and pro- 
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CPU 




Figure 6: The Quadrics system. 



gram submission as well as I/O tasks are carried out on a front-end workstation. 
The so-called Z-CPU manages the instruction stream (synchronous for all comput- 
ing nodes) and performs global integer arithmetics. 

The programming language TAO looks FORTRAN like. It supports so-called 
syntagmas, structs which are very useful for efficient programming. TAO supports 
a local address space whereas communication from and to remote memories uses 
additional global address bits. 

The compute nodes perform the local floating point operations. They are 
composed of memory, a floating point unit called MAD, and the communication 
chips. The MAD is able to perform 2 floating point operations within one clock 
cycle, that is (25MHz) _1 . Thus, one processor has a peak speed of 50 Mflops. 

The smallest configuration of the Quadrics computers is a Ql, a unit of 8 
compute nodes, called a board. The nodes are connected by 6 communication links 
to their neighbor nodes in a 3-dimensional grid, see Fig. 6. Larger machines like 
the 32-node Q4 or the 512-node QH4 are composed of several boards. The network 
topology of the large system is hard-wired. A 512-node Quadrics QH4 is typically 
configured as an 8 x 8 x 8-grid, toroidally closed in each direction. 

In order to transfer one 32-bit word from the memory of the neighboring pro- 
cessor to the memory of the local processor, apart from start-up times, 8 clock 
cycles are needed in the ideal case. In practical applications, this limit can not be 
reached completely, however. 

APEIOO/Quadrics is the European pathway towards Tera-computing. The 
APE groups in Pisa and Roma are currently developing their next generation of 
SIMD machines, the parallel APEmille systems. This machine will exhibit a com- 
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pute performance of more than 1 Teraflops. The commercial version of this machine 
will go under the name PQE2000. This machine will have a parallel MIMD front-end 
system. 

The APEmillc architecture is extremely similar to APE100 (same topology 
and interconnectivity) . Improvements, beyond much higher performances, include 
integer and double precision arithmetic. 

As the most important extension to the present APE100 architecture all pro- 
cessing nodes may access their own local memories using different local addresses. 
Thus, APEmille will overcome the local addressing restrictions of APE100, allowing 
to implement algorithms that could not scale on APE100 machines. 

A second advantage is the more general long-distance routing capability. In- 
deed, rigid communications can be set-up at arbitrary distances and automatically 
handled by the network (cut-through-routing). Again, this simplifies coding of 
the systolic moves and reduces communication latency. Furthermore, the APE 100 
software will be upward compatible for APEmillc, and FFT-packs developed for 
APE100 can readily be ported onto APEmillc. 

3.2. Embedding a 1- dimensional ring on Quadrics 

As has been shown in detail in Refs. 11 and 16, the 3-dimensional network of the 
Quadrics can be exploited to emulate the 1-dimensional ring functionality and to 
carry out shifts of any stride. 

On a 3-dimensional grid based machine, the shifts can be described by a se- 
quence of hardware communication operations in each of the three directions. We 
encode the stride a; in terms of the distance between processor and processor I 
on the systolic ring, by 

l = h + hP2 + hV2Pi=au (1) 
with pi being the number of processors in direction i, and k the coordinates of 
processor I in direction i on the 3-dimensional processor grid, the total number 
of processors being p = poPiP2- The cost function C(ai) in units of hardware 
communication operations is given as 

C{a t ) = lo + h+h- (2) 

The total cost function C t is the sum of the individual C(ai) over all circular shifts 
with stride o;, a; = 0, . . . ,p — 1: 

P2-1 pi— lpo — i 

c t = Y.Y. E( /2 + 1 + z i + 1 + Zo + 1 ) = f^°+pi+P 2 + 3 )- ( 3 ) 

; 2 =o Ji=o 2 =o 

In practise, one can exploit hardware communication for each axis in both 
directions. Hence, the average cost per shift can be reduced by a factor of 2, and 
on a 3-dimensional machine with next-neighbour connectivity the wall clock time 
for interprocessor communication decreases like 

U P c oc -(po + pi +P2 + 6), (4) 
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for a fixed size V of the data array. 

4. Benchmarking the Transpose Algorithm on APEIOO/Quadrics 

We have mapped the FFT based on the systolic Transpose Algorithm onto several 
Quadrics parallel SISAMD computers. Here we present timing results and scaling. 
In the next section we discuss two physics case studies where FFT can play an 
important role. 

For a 32-node APEIOO/Quadrics Q4 system and a 512-node QH4 system we 
give detailed timing results. We plot the total execution times against the matrix 
size V, cf. Fig. 7. On both machines the asymptotic behaviour is reached when 
the length of the stream set involved in row-shifting is larger than 32 data elements. 
For stream smaller than 32 elements we are faced with a start-up time effect. The 
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Figure 7: Double log plot of transposition vs. matrix size V on a 32-node Quadrics 
Q4 and a 512-node Quadrics QH4. 

two curves intersect each other for about n = 1024. 

In order to test the scaling law, Eq. 4, we plot the times for interprocessor 
communication for a fixed matrix size of 2048 x 2048 complex elements on various 
Quadrics platforms, against the number of processors (see Fig. 8). A fit to the 
data according to t oc p a reveals an exponent a of —0.76. The naive expectation 
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from Eq. 4 for cubic machine topologies would be —0.66. However, the actual 
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1000 i — . . . <—> 
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Figure 8: Scaling of the Transpose Algorithm on 32-node to 512-node Quadrics 
systems for fixed problem size (V = 2048 x 2048 elements). 

topologies were V( p=512 ) =8x8x8, V( p=2 5 6 ) =8x8x4, V( p=128 ) =8x4x4, and 
V (p = 32 ) =8x2x2. 

5. Applications of FFT on APEIOO/Quadrics 

The work described in the previous section has been motivated by the need to have 
a fast FFT in some physics applications of ours. The solution of these problems 
can benefit considerably from the power of APE computers. 

5.1. 2- dimensional FFT: Kraichnan model 

In order to simulate the Kraichnan model 15 , an efficient 2-dimcnsional FFT trans- 
form was required. The Kraichnan model is a toy model which has attracted much 
interest recently, because of its analytical simplicity with respect to the full Navier- 
Stokes equations. An often studied problem of turbulence is the anomaly of the 
scaling exponents in the inertial range. In order to get reliable numerical values for 
these exponents, huge computer simulations are required. 

This toy model describes a passive scalar T(x, t) which moves in a velocity field 
which is synthetically generated with a given scaling exponent £. One studies the 
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scaling exponent £t of the passive scalar as a function of £. The relevant equations 
are: 



The random force /(x, t) is delta correlated in time and Gaussian with corre- 
lator C (a function with support of the order of size of the system) in space. The 
velocity field is u, its statistical properties are given by Eq. 5 and, as it is clear from 
the expression for D a @, the incompressibility condition is fulfilled. The exponent £ 
is just the scaling exponent imposed on the velocity field (relevant values are in the 
range [0,2]). 

The main practical problem for the simulation of a system of this kind are the 
huge size of the system (i. e. 8192 x 8192) and the fact that a new velocity field is 
required at each time step. 

The velocity field with the required statistical properties given by Eq. 5 is easily 
generated in Fourier space just by extracting random gaussian amplitudes (for the 
k modes) with given variance ~ fc _1_ £/ 2 . The direction of the k-th. component uj, is 
choosen such that k • u,t = 0. In addition, with FFT, periodic boundary conditions 
are automatically fulfilled. 

Alternative algorithms could be used for the generation of the velocity field, 
but, since the required statistical property is non-local, they would face the non- 
locality problem anyway and communication between different processors would 
become very heavy. In the case of our problem, a further major performance gain 
can be achieved in a very simple way. Since we need a specific geometric mapping 
of the variables only in real space, we can skip a transposition step by generating 
transposed amplitudes in k space. This is of course true in several interesting cases, 
when we need to go to k space in order to make some computation and then we 
want to come back to real space. 

The FFT which has been described in this article has been implemented on 
a 512-nodc QH4 machine and has been optimized for working with very large lat- 
tice sizes (i.e. 8192 x 8192 elements). We are able to compute an FFT on the 
corresponding date set of about 512 Mbyte in 4 seconds. This means that we are 
working at about 9% of the peak performance for the above problem. If we perform 
the complete FFT including the back-transposition, we achieve about 6 % of the 
peak performance. 



d t T + (u • V) T = X V 2 T + / 
(f( X ,t)f(y,t'))=6(t-t')C(\x-y\) 
<i>i(x, t) Vj (y, t')) = D^(x - y)S(t - t>). 



(5) 



D a0 (x) = D(O)S a0 -d a0 (x) 
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5.2. 4- dimensional FFT: Quantum chromodynamics 

In principle, the above Transpose approah to parallel FFT could be applied for 
systems of any dimensionality > 2. However, from a practical point of view, in the 
implementation of FFT on parallel systems we meet rather different situations: on 
a low dimensional lattice, the length of the axes obviously can be much larger than 
that of a high dimensional lattice. However, the length of at least two axes must 
be as large as the length of the systolic ring, otherwise we would have to pad with 
zeroes and would loose efficiency 

This situation occurs e. g. in simulations of lattice quantum chromodynamics 17 
(QCD). QCD is formulated on a 4-dimensional euclidean space-time lattice. Some 
aspects of QCD simulations, like e.g. gauge fixing 18 , Fourier acceleration of Krylov 
solvers 19 or accelerated updates might involve FFT 20 . 

We can not go into detail of lattice QCD here, but rather discuss the FFT 
implications of a 4-dimcnsional field f(xo,xi,X2,X3). This field is assigned onto a 
finite 4-dimensional lattice of size xq = 0, . . .po — 1, x\ = 0, . . .p\ — 1, X2 = 0, . . .p2~ 
1, and x 3 = 0, . . .p3 — 1. Typical sizes of QCD lattices range between 16 4 and 64 4 
sites, hence on massively parallel systems with p > 100, the direct application of 
the approach described above would not be adequate for these applications. 

However, the systolic FFT idea can be readily generalized: In the case of the 2- 
dimensional FFT on a 3-dimensional grid computer, one axis was distributed while 
the other was local on the systolic ring. Handling the 4-dimensional FFT on a 
3-dimcnsional grid computer one has more than one option. 

A very natural choice is to stick to the lexicographic data layout of the 4- 
dimensional QCD lattice. This leads to introduce many systolic rings along the grid 
axes of a given dimension. On a 3-dimensional grid the set of axes along a given 
direction is specified by fixing two grid coordinates orthogonal to the direction, 
say coordinates xq and x\. Then, we can consider the axes specified by xq and 
Xi of our 4-dimensional system as a number of poPi systolic rings. On each ring, 
coordinate x 2 of the 4-d system is laid out while x% is local. Therefore, each of 
the rings corresponds to the situation we have encountered in the above case of 
2-dimensional FFT. The Transpose Algorithm intorduced above applies readily to 
this situation. 

The order of local axes 4-dimensional FFT is schematically illustrated in Fig. 9. 
Initially, we perform the FFT along the third dimension, regarded as local at this 
time. Next, we transpose axis 2 with axis 3 and perform the FFT along the second 
axis being local now. This transposition is carried in parallel for all coordinates 
x and x\. It is obvious how the process can be continued. In the second step, 
we fix dimension and 3 and transpose axis 1 with 2. Hence we can FFT axis 1, 
being local now. In the third step, we fix dimension 2 and 3, transpose axis and 
1 and can perform FFT on axis being local now. Finally, we transpose axis 3 
with and re-arrive in the original data layout. Algorithm 5 gives the pseudo-code 
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Figure 9: Schcmatical view of 4-dimensional FFT on a 3-dimcnsional processor grid. 



representation for the ensuing algorithm. 



Algorithm 5 4-dimcnsional parallel FFT on 3-dimcnsional processor grid. 

call l-d-FFT(axis 3) 

call transposc(axis 2 and 3) 

call l-d-FFT(axis 2) 

call transposc(axis 1 and 2) 

call l-d-FFT(axis 1) 

call transposc(axis and 1) 

call l-d-FFT(axis 0) 

call transpose(axis 3 and 0) 



6. Summary 

We have presented a novel scalable parallel algorithm for matrix transposition and 
its application within FFT. Universality of the method is achieved by assignment 
of the data onto an abstract systolic ring that can be readily mapped onto any 
parallel system. For mesh systems the use of the hyper-systolic communication 
structure improves the communicational complexity considerably. In our scheme we 
can avoid computation of local addresses. Thus, in particular for SISAMD systems 
like APEIOO/Quadrics, the implementation of efficient FFT became feasiblle. 

We have implemented the FFT scheme within a 2-dimcnsional simulation of the 
Kraichnan model for passive advection. Further, we suggested an implementation 
of 4-dimensional FFT applied within lattice quantum chromodynamics. In the first 
application the FFT performs with an efficiency of about 9 % of the peak performace 
of the system, in a general application we reach about 6 %. 

On the European Teraflops system APEmille to come there is no restriction as 
to local addressing. Pre-skewing and Re-skewing will benefit mostly. Coding will 
be straightforward and the computation significantly faster (about an order of mag- 
nitude for a 512 node machine compared to on APEIOO/Quadrics). On APEmille, 
we expect to achieve a performance for FFT of about 15 % of the theoretical peak 
performance. 

To set the numbers achieved on APE100 and expected on APEmille into per- 
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spective we can compare with results from general purpose parallel machines. On 
the Cray T3E for an 8 node partition, the 3-dimcnsional FFT on complex data per- 
forms with 10 % of the peak performance using the optimized scientific library 21 . 
On the Hitachi SR2201, a 8fc x 8fc complex data array is Fourier transformed with 
a speed of 30 % of the machine's peak performance 22 . 

Currently we are working on an efficient realization of 1-dimensional FFT. 
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Appendix 

A: Skewing on SISAMD Computers 

The first and last phases in the matrix transposition presented above involve the 
representation of M in skew order. On systems with indexed addressing facilities 
skewing is just a local address computation. However, for SISAMD systems like 
APEIOO/Quadrics, skewing is more difficult as its naive implementation spoils scal- 
ability. 

Algorithm 6 Skewing for SISAMD systems. 

foreach cell i = : p — 1 € systolic array 
for q = : I - 1 
g = h mod2? 
where g q > 2 q 
for j = : p — 1 

m 3,i = m [(j + 29),i] 

end for 
endwhere 
end for 
end foreach 

In order to solve this problem we have developed a systolic skewing algorithm 
that avoids indexed addressing and shows logarithmic complexity. Thus the method 
is well suited for SISAMD systems. 

We start from the initial data configuration as illustrated in Fig. 2. Let p 
be a number 2 l , I e N, and let h number the cells from to p — 1 along the 
systolic ring. Taking g q = /imod 2 ?, q = 0, . . . , I — 1, we can partition the cell array 
successively in | |, . . ., 2 parts of equally numbered sequences. In each step we can 
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impose a conditional statement that is true where g q > 2 q . There a local memory 
shift operation by a stride of 2 q is executed. After 1 such steps, the data array is 
represented in skew order. The process is illustrated as pseudo-code in Algorithm 6. 

B. Skewing by Translation Invariance 

An alternative to pre and re-skewing as presented above can be constructed by 
usage of the translation invariance of the system. The task is to calculate a Fourier 
transformed function fj (we consider one dimension) that is computed from the real 
space function fc, i = 0, . . . , N — 1 which is a vector /$ = 9(^ + k)mod N , translated 
by k elements. The Fourier transform 

= exp ^ (6) 

»=0 

can be written in terms of g: 

f 2m ji 

h — 2—t %+fc)modiv exp 

2 = 

2mjk%—^ 2mj(i + k) 
= exp — ^ 9 {l+k)modN exp 

i=0 

2m jk 2m jl 

= exp ^ gi exp—. (7) 

;=o 

The translation in real space amounts to the application of a mode dependent 
phase factor in Fourier space. Similarly, a translated left hand side is computed 
using corresponding phase factors. 

A 2-dimensional FFT as given in Algorithm 4 can be performed on SISAMD 
systems omitting 3 skewing steps. However, one re-skewing step remains: 

Algorithm 7 2-dimcnsional parallel FFT using translation invariance. 

call l-d-FFT-pre-skew 
call row-shift 

call 1-d-re-skew-FFT-pre-skew 
call row-shift 
call re-skew 



We note that in the actual implementation of this method one has to carefully 
address precision issues. 
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